Template Mining for Information Extraction from Digital Documents
نویسنده
چکیده
WITHT H E RAPID GROWTH OF DIGITAL INFORMATION RESOURCES, information extraction (1E)-the process of automatically extracting information from natural language texts-is becoming more important. A number of IE systems, particularly in the areas of news/fact retrieval and in domain-specific areas, such as in chemical and patent information retrieval, have been developed in the recent past using the template mining approach that involves a natural language processing (NLP) technique to extract data directly from text if either the data and/or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to the instructions associated with that template. This article briefly reviews template mining research. It also shows how templates are used in Web search engines-such as Alta Vista-and in meta-search engines-such as Ask Jeeves-for helping end-users generate natural language search expressions. Some potential areas of application of template mining for extraction of different kinds of information from digital documents are highlighted, and how such applications are used are indicated. It is suggested that, in order to facilitate template mining, standardization in the presentation. and layout of information within digital documents has to be ensured, and this can be done by generating various templates that authors can easily download and use while preparing digital documents. Gobinda G. Chowdhury, Division of Information Studies, School of Applied Science, Nanyang Technological [Jniversity, N4#2a-32, Nanyang Avenue, Singapore 639798 LIBRARY TRENDS, Vol. 48, No. I , Summer 1999, pp. 182-208 01999 The Board of Trustees, University of Illinois CHOWDHURY/TEMPLATE MINING 183 INFORMATIONEXTRACTION MINING AND TEMPLATE Information extraction (IE) ,the process of automatically extracting information from natural language texts, is gaining more and more importance due to the fast growth of digital information resources. Most work on IE has emerged from research into rule-based systems in natural language processing. Croft (1995) suggested that IE techniques, primarily developed in the context of the Advanced Research Projects Agency (ARF’A) Message Understanding Conferences (MUCs), are designed to identify database entities, attributes, and relationships in full text. Gaizauskas and Wilks (1998) defined IE as the activity of automatically extracting pre-specified sorts of information from short natural language texts typically, but by no means exclusively, newswire articles. Although works related to IE date back to the 1960s, perhaps the first detailed review of IE as an area of research interest in its own right was by Cowie and Lehnert (1996). However, a detailed review dividing the literature on IE into three different groups-namely, the early work on template filling, the Message Understanding Conferences (MUCs), and other works on information extraction-has recently been published by Gaizauskas and Wilks (1998). Template mining is a particular technique used in IE. Lawson et al. (1996) defined template mining as a natural language processing (NLP) technique used to extract data directly from text if either the data and/ or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to instructions associated with that template. Although different techniques are used for information extraction and knowledge discovery-as described by Cowie and Lehnert (1996), Gaizauskas and Wilks (1998), and Vickery (1997)-template mining is probably the oldest information extraction technique. Gaizauskas and Wilks (1998) reported that templates were used to extract data from natural language texts against which “fact retrieval” could be carried out in the Linguistic String Project at New York University that began in the mid-1960s and continued into the 1980s (reported by Sager, 1981). Numerous studies have been conducted, though most of them are domain-specific, using templates for extracting information from texts. This article briefly reviews some of these works. It also shows how templates are used for information retrieval purposes in major Web search engines like AltaVista (ht tp: / / www.altavista.com). This discussion proposes that template mining has great potential in extracting different kinds of information from documents in a digital library environment. To justify this proposition, this article reports some preliminary tests carried out on digital documents, more specifically on some articles published in the D-Lib Muguzine (http://www.dilib.org/dilib). 184 LIBRARY TRENDS/SUMMER 1999 WORKSON TEMPLATE MINING Template mining has been used successfully in different area: extraction of proper names by Coates-Stephens (1992), Wakao et al. (1996),and by Co~7ey arid Lehnert (1996); extraction of facts from press releases rciatect to company and financial information in systems like ATRAYS (Lytinrn & Gershman, 1986), SCISOR (Jacobs & Rau, l990), JASPER (Xndersen, et al., 1992; Andersen & Huettner, 1994), LOLITA (Costantino, Morgan, & Collingham, 1996), and FlES (Chong & Goh, 1997); abstracting scientific papers by Jones and Paice (1992); summarizing new product information by Shuldberg et al. (1993); extraction of data from analytical chemistry papers by Postma et al. (1990a, 1990b) and Postma and Kateman (1993); extraction of reaction information from experimental sections of papers in chemistry journals by Zaniora and Blower (1984aJ984b); processing of generic and specific chemical designations from chemical patents by Chowdhury and Lynch (1992a,l992b) and by Kemp (1995); and extraction of bibliographic citations from the full texts of patents by Lawson et al. (1996). Template mining has largely been used for extraction of information from news sources arid from texts in a specific domain. Gaizauskas and M’ilks (1998) reported that applied work on filling structured records from natural language texts originated in two long-term research projects: The Linguistic String project (Sager, 1981) at New York University and the research on language understanding and story comprehension carried out atYale University by Schank and his associates (Schank, 1975; Schank & Abelson, 1977; Schank & Riesbeck, 1981). The first research was conducted in the medical science domain, particularly involving radiology reports and hospital discharge summaries, while the second research led to many other research works in the early 1980s that used the principles and techiiiques of IF, to develop practical applications such as the FRUMP system developed bj7 De Jong (1982). FRUMP used a simplified version of SCRIPTS, proposed by Schank (Schank, 1975; Schank & Abelson, 1977; Schank & Riesbeck, 1981), to process text from a newswire source to generate story summaries. ATRANS (Lytinen & Gershman, 1986), another IE system, was soon developed and commercially applied. ATRANS used the sc+t approach (Schank & Abelson, 1977; Schank & Riesbeck, 1981) for automatic processing of money transfer messages between banks. Another successful application of IE has produced a commercial online news extraction system called SCISOR (Jacobs & Rau, 1990) that extracts information about corporate mergers and acquisitions from online news sources. JASPER CHOWI)HURY/TEMPLATE MINING 185 (Andersen et al., 1992; Andersen & Huettner, 1994) was another IE system developed €or fact extraction for Reuters. JASPER uses a templatedriven approach and partial analysis techniques to extract certain key items of information from a limited range of texts such as company press releases. LOLITA (Costantino, Morgan, & Collingham, 1996) is a financial IE system that uses three pre-defined groups of templates designed according to a “financial activities approach,” namely, company related templates, company restructuring templates, and general macroeconomic templates. In addition, the user-definable template allows the user to define new templates using natural language sentences. Chong and Goh (199’7) developed a similar template-based financial information extraction system, called FIES, that extracts key facts from online news articles. Applications of template mining techniques for automatic abstracting can be traced back to 1981 when Paice (1981) used what he called indicator phrases (such as “the results of this study imply that . . . ”) to extract topics and results reported in scientific papers for generating automatic abstracts. Paice continued his work to improve on this technique and for resolving a number of issues in natural language processing (see, for example, Jones & Paice, 1992; Paice & Husk, 1987). Shuldberg et al. (1993) described a system that digests large volumes of text, filtering out irrelevant articles and distilling the remainder into templates that represent information from the articles in simple slot/filler pairs. The system consists of a series of programs each of which contributes information to the text to help determine which strings constitute appropriate values for the slots in the template. Chemical and patent information systems have been the prominent areas for the application of templates for IE. TICA (Postma et al., 1990a, 1990b; Po5tma & Kateman, 1993) used templates to extract information from the abstracts of papers on inorganic tritimetric analysis. The parsing program used in TICA followed an expectation-driven approach where words or groups ofwords expect other words or concepts to appear. Zamora and Blower (1984aJ984b) developed a system that automatically generates reaction information forms (RIFs) from the descriptions of syntheses of organic chemicals in the Journal of the American Chemical Society. The techniques explored in the semantic phase of this work include the use of a case grammar and frames (Schank & Abelson, 19’7’7; Schank & Riesbeck, 1981) to map the surface structure of the text into an internal representation from which the RIFs can be formed. Following the same methodology, Ai et al. (1990) developed a system that generates a summary of all preparative reactions from the experimental sections of the Journal of Organic Chemistry papers. This work identified seven sequences of events that were used for building templates for the text of an experimental paper. Chowdhury and Lynch (1992a, 199213) developed a template-based method for converting to GENSAL (ageneric structure language developed 186 LIBRARY TRENDS/SUMMER 1999 at the University of Sheffield) those parts of the Derwent Documentation ‘4bstracts that specify generic chemical structures. Templates for processing both the variable and multiplier expressions, which predominate in the assignment statements in the Denvent Documentation Abstracts, were identified for further processing. As part of this research, Chowdhury (1992) also conducted a preliminary discourse analysis of European chemical patents that identified the common patterns of expressions occurring in different parts of patent texts. This work prompted further research (Kemp, 1995; Lawson et al., 1996) leading to the use of template mining in the full text of chemical patents. Lawson et al. (1996) reported their work using the template mining approach to isolate and extract automatically bibliographic citations to patents, journal articles, books, and other sources from the full texts of English-language patents. There is also some work that examines the development of specific tools and techniques for information extraction using templates. For example, Sasaki (1998) reported an ongoing project on building an information extraction system that extracts information from a real-world text corpus such as newspaper articles and Web pages. As part of this project, an inductive logic programming (ILP) system has been developed to generate IE rules from examples. Gaizauskas and Humphreys (1997) described the approach taken to knowledge representation in the LaSIE information extraction system, particularly the knowledge representation formalisms, their use in the IE task, and how the knowledge represented in them is acquired. LaSIE first translates individual sentences to a quasi logical form and then constructs a discourse model of the entire text from which template fills are derived. Guarino (1997) argued that the task of information extraction can be seen as a problem of semantic matching between a user-defined template and a piece of information written in natural language. He further suggested that the ontological assumptions of the template need to be suitably specified and compared with the ontological implications of the text. Baralis and Psaila (1997) argued that the current approaches to data mining usually address specific user requests, while no general design criteria for the extraction of association rules are available for the end-user. To solve this problem, they have proposed a classification of association rule types that provides a general framework for the design of association rule mining applications and predefined templates as a means to capture the user specification of mining applications. Although numerous research projects have been undertaken, and some are currently ongoing, Croft (1995) suggested that the current state of information extraction tools is such that it requires a considerable investment to build a new extraction application, and certain types of information are very difficult to identify. However, Croft further commented that extraction of simple categories of information is practical and can be CHOWDHURY/TEMPLATE MINING 187 an important part of a text-based information system. This article highlights some potential areas of application of template mining in a digital library environment. USEOF TEMPLATES ENGINES IN WEBSEARCH Gaizauskas and Wilks (1998) suggested that there is a contrast between the aims of information extraction and information retrieval systems in the sense that IR retrieves relevant documents from collections, while IE extracts relevant information from documents. However, the two are complementary, and their use in combination has the potential to create powerful new tools in text processing and retrieval. Indeed, IE and IR are equally important in the electronic information environment, particularly the World Wide Web, and templates have been used both for IR and IE. Many applications of template mining mentioned above handle digital texts available on the Web, while search engines use templates to facilitate IR. Search engines are one of the most essential tools on the Internetthey help find Web sites relating to a particular subject or topic. Search engines are basically huge databases containing millions of records that include the URL of a particular Web page along with information relating to the content of the Web page supplied in the HTML by the author. A search engine obtains this information via a submission from the author or by the search engine doing a “crawl” using “robot crawlers” of the Internet for information. The most popular search engines include: AltaVista, Excite, Hotbot, Infoseek, Lycos, Webcrawler, Yahoo, and so on. Some search engines use templates to help end-users submit natural language queries used by search engines to conduct a search on specific topics. Two small sets of tests were conducted to see how this is done in a large search engine-AltaVista-and in a meta search engine-Ask Jeeves. The following section shows how these search engines use templates for natural language query formulation in their interfaces. USEOF TEMPLATES VISTA IN ALTA The Alta Vista search engine (http://www.altavist.com) helps users find information on the Web. One interesting feature of this search engine is that a user can enter one or more search termdphrases or can type a natural language statement such as “What is the capital of Alaska?” or “Where can I find quotations by Ingmar Bergman?” Taking the second option, a simple query statement, “Where can I find information on Web search engines?” was typed in the specified box of the Alta Vista search interface (see Figure 1). Along with the results, Alta Vista came up with two templates that contain natural language sentences related to the search topic (see Figure 2). By clicking on the box at the end of the statement “How do I (Internet skill)?” the system shows a box containing various 188 LIBRARY TRENDS/SUMMER 1999 options (Figure 3), any ofwhich can be chosen to complete the sentence, the default one being “search through ALL web sites.” By choosing this, or any other option from the box, a user can formulate a sentence-like query such as: “How do I search through all Web sites?” or “How do I learn HTML?” or “How do I use the Internet as a telephone?” and so on. AV Famil Fklter AV Photo Finder AVTools Entertainlent Health Holidav S h o m n q Careers
منابع مشابه
Header Metadata Extraction from Semi-structured Documents Using Template Matching
With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state auto...
متن کاملTemplate mining for the extraction of citation from digital documents
s)/Type of Article/para/ para NO Abstract (before Introduction, Background, Content) Yesbefore Introduction, Background, Content) Yes info in footnote of page1(vol, ISSN, no, number, time, pp.) Call for journal template Yes NO Yes NO Call for journal template
متن کاملارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملSpecial Issue on Searching and Mining Literature Digital Libraries
Information extraction and text mining applications are just beginning to tap the immense amounts of valuable textual information available online. In order to extract information from millions, and in some cases, billions of documents, different solutions to scalability emerged. We review key approaches for scaling up information extraction, including using general-purpose search engines as we...
متن کاملMining Technique Using Association Rules Extraction
automatically extracting association rules from collections of textual documents. The technique called, Extracting Association Rules from Text (EART). It depends on keyword features for discover association rules amongst keywords labeling the documents. In this work, the EART system ignores the order in which the words occur, but instead focusing on the words and their statistical distributions...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Library Trends
دوره 48 شماره
صفحات -
تاریخ انتشار 1999